About the project

Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.

Chapter 1

Description

Introduction to Open Data Science is a course held in 2017 for the second time. It promices to give all the skills to call themselves data scientists. Bold.

Regression and model validation

Describe the work you have done this week and summarize your learning.

This week I did the regression homework. Having done the DataCamp exercises there was very little new to do in the RStudio exercise itself. Seems a bit pointless to do the RStudio exercise as you could pretty much copy-paste the code from the DataCamp and be done. To make things a bit more complicated for myself I added some complexity not requested in the exercise.

Data wrangling I had done before, so that was not new, but a good reminder. I had done regression with R before, but the validation/prediction stuff was new. I also found that rather than directly coding in .Rmd one should code in .R. Seems easier at least.

Analysis part

  1. Read data to local folder
  2. Show a graphical overview
  3. Choose explanatory variables
  4. Explain the relationship between the variables.
  5. Check for model validity. (This should be 4?)

1.

In this part data is loaded into a data frame. Information about the dataset can be found in here. In brief it has data on students, their attitudes, learning styles, basic demographics and exam scores.

# Read data from a local comma separated values file. 
learning2014 <- read.csv("data/learning2014.csv")

# Get dimensions of the data (166 students and 8 observations). Get structure.
dim(learning2014)
## [1] 166   8
str(learning2014)
## 'data.frame':    166 obs. of  8 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
# The dataset has been collapsed from a dataset measured by Kimmo Vehkalahti. 
# Deep, stra and surf refer to deep, strategic and surface learning values normalized to 1-5.

2.

Do some graphs and tables.

# Show a graphical overview of the data and show summaries of the variables in the data. X is the numeric order of the students.
library(ggplot2)
library(GGally)
# Text table of the dataset
summary(learning2014)
##        X          gender       age           attitude          deep      
##  Min.   :  1.00   F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  1st Qu.: 42.25   M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Median : 83.50           Median :22.00   Median :3.200   Median :3.667  
##  Mean   : 83.50           Mean   :25.51   Mean   :3.143   Mean   :3.680  
##  3rd Qu.:124.75           3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##  Max.   :166.00           Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00
# initialize plot with data and aesthetic mapping
p1 <- ggplot(learning2014, aes(x = attitude, y = points))

# define the visualization type (points)
p2 <- p1 + geom_point()

# add a regression line
p3 <- p2 + geom_smooth(method = "lm")

# add a main title and draw the plot
p4 <- p3 + ggtitle("Student's attitude versus exam points")

# add equation and r-squared to the graph after linear model has been done.


# create a more advanced plot matrix with ggpairs()
pp <- ggpairs(learning2014[-1], mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
pp

shapiro.test(learning2014$deep)
## 
##  Shapiro-Wilk normality test
## 
## data:  learning2014$deep
## W = 0.9812, p-value = 0.02371
shapiro.test(learning2014$stra)
## 
##  Shapiro-Wilk normality test
## 
## data:  learning2014$stra
## W = 0.99068, p-value = 0.3504
shapiro.test(learning2014$surf)
## 
##  Shapiro-Wilk normality test
## 
## data:  learning2014$surf
## W = 0.99219, p-value = 0.5073
shapiro.test(learning2014$points)
## 
##  Shapiro-Wilk normality test
## 
## data:  learning2014$points
## W = 0.96898, p-value = 0.0008932

Description of the data: There are more females than males. Mean age of the students is 25.5 years and ranges from 17-55. Mean attitude is 3.1. Strategy attributes look like they are roughly normally distributed. However that is not the case for deep and points. The normality test was done with Shapiro-Wilks normality test. If it gives p<0.05 then usually normality is not expected to hold. Most of the attributes do not correlate much. Biggest correlations are with points and deep learning strategy with both genders. Also with males (not females) surface learning had a strong negative correlation with deep learning strategy.

3.

Explore how well different student attributes predict their exam points.

## Perform an explanatory model with a regression model. Choose three variables. Drop non-significant ones and redo.
# Do a linear model.
lModel <- lm(points ~ attitude + stra + surf, data = learning2014)
summary(lModel)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
# Surface learning is not even close to significant. The strategic learning is marginally significant. So drop surf.
lModel <- lm(points ~ attitude + stra, data = learning2014)
summary(lModel)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09
# After dropping surf strategic learning is also p>0.05 so drop that also. Only attitude remains.
lModel <- lm(points ~ attitude, data = learning2014)
summary(lModel)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
# Add regression formula to the attitude vs points plot.
# Function to get regression formula.
lm_eqn = function(m) {
  
  l <- list(a = format(coef(m)[1], digits = 2),
            b = format(abs(coef(m)[2]), digits = 2),
            r2 = format(summary(m)$r.squared, digits = 3));
  
  if (coef(m)[2] >= 0)  {
    eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,l)
  } else {
    eq <- substitute(italic(y) == a - b %.% italic(x)*","~~italic(r)^2~"="~r2,l)    
  }
  
  as.character(as.expression(eq));                 
}

p5 = p4 + annotate("text", x = 2.1, y = 34, label = lm_eqn(lModel), colour="black", size = 5, parse=TRUE)
p5

Linear regression can be described with a formula dependent = alpha + beta(explanatoryVariable) + noise One can have more than one explanatory variables. In this case only attitude remains as a statistically significant explanatory variable. In the current model alpha is 11.6 and beta is 3.5. That is each attitude attribute point is worth 3.5 points in the exam.

4.

Interpretation.

Multiple R-squared value is 0.19. Multiple R-squared is known as R-squared. It tells how well the model fits the data. A value of 1 means the model perfectly predicts data, whereas 0 means there is no predictive value. The value of 19 % means that roughly 19 percent of the course score in the learning2014 dataset is explained by the attitude rating. It can help to think that R-squared is correlation coefficient x correlation coefficient. If one was using multiple variables using the Adjusted R-squared value would be recommended.

5.

Validation of model sanity.

## Check if model fits well or not. 
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5.
# Draw three panels in a single figure.
par(mfrow = c(1,3))
# Plot Residuals vs Fitted values (1), Normal QQ-plot (2), and Residuals vs Leverage (5).
plot(lModel, which = c(1,2,5))

Linear regression model assumes linearly distributed data and normally distributed errors. Errors should also have constant variance. One can display graphs to test how well the model fits the assumption.

Residuals vs Fitted plot is used to check if variance is constant or not. If the data points are roughly evenly distributed good, if the data has a clear shape then bad. Residuals vs Fitted is roughly evenly distributed, so the size of errors should not depend on the explanatory variables.

Q-Q plot is used to measure whether errors are normally distributed. If points follow the line then normality can be presumed. Q-Q plot seems reasonably near the assumption of normal error distribution.

Residuals vs Leverage plot is used to check for outlier observations. If there are observations with particularly high leverage one should consider removing them from the analysis. Leverage plot seems like there is no outlier observation (so no very high leverage observations).

All in all the model seems reasonable.


Chapter 3. Logistic regression

Analysis part

  1. Read data to local folder
  2. Choose interesting variables
  3. Explore the distributions of the selected variables
  4. Use logistic regression to explore high alcohol use and the chosen variables
  5. Explore the predictive power of the created model
  6. Bonus 10-fold cross-validation

1.

# Load required libraries
library(ggplot2)
library(dplyr)
library(boot)

# Load data from .csv
alc <- read.csv("data/student-alc.csv")

# Column names
colnames(alc)
##  [1] "X"          "school"     "sex"        "age"        "address"   
##  [6] "famsize"    "Pstatus"    "Medu"       "Fedu"       "Mjob"      
## [11] "Fjob"       "reason"     "nursery"    "internet"   "guardian"  
## [16] "traveltime" "studytime"  "failures"   "schoolsup"  "famsup"    
## [21] "paid"       "activities" "higher"     "romantic"   "famrel"    
## [26] "freetime"   "goout"      "Dalc"       "Walc"       "health"    
## [31] "absences"   "G1"         "G2"         "G3"         "alc_use"   
## [36] "high_use"

Description of the data from the authors (truncated): This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades.

Link to the data and a fuller description.

2.

Choose four interesting variables and present hypothesis about their relationship with alcohol consumption. I did not select DataCamp ones as the results are known already, except the final grade as that is an obvious one.

19 activities - extra-curricular activities (binary: yes or no). I would imagine that activities predicts low alcohol use. Might not, as for example hockey youths in Finland consume a lot of alcohol. Interesting to see if there is any relation to high use. I would guess a marginally significant relation.

21 higher - wants to take higher education (binary: yes or no). Those with ambition should have less alcohol use if they are serious. Wanting to take higher education though is easy. Again, interesting to see any relation to high use. I would guess a low but still significant relation.

24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent). Having poor family relationship I would imagine predicts high alcohol use. I would guess this to have the strongest relation to high alcohol use of my variable selections.

32 final grade (numeric: from 0 to 20, output target). Final grade should have some predictive value, but not very strong. At least as far as I can remember from the DataCamp exercise. This selection is a bit boring.

3.

## Do some exploratory search with the chosen variables and alcohol use. 
# Draw a bar plot of high_use by activities. Gives graphs divided by activities.
g1 <- ggplot(alc, aes(x = high_use))
g1 + geom_bar() + facet_wrap("activities") + ggtitle("Students divided by extra activities (no/yes) to high users (F/T)") + xlab("High alcohol use (False/True)")

19 activities. Looks like there is a small effect of students having extra activities having less heavy alcohol users. The difference between active students and non-activity students is not large.

# Draw a bar plot of high education aspirations and high alcohol use.
g2 <- ggplot(alc, aes(x = high_use))
g2 + geom_bar() + facet_wrap("higher") + ggtitle("Students divided by wanting to take higher education (no/yes) to high users (F/T)") + xlab("High alcohol use (False/True)")

# 21. Education. There are very few people not wanting to take higher education, so this variable should be very pointless in future analyses.

alc %>% group_by(higher, high_use) %>% summarise(count = n())
## # A tibble: 4 x 3
## # Groups:   higher [?]
##   higher high_use count
##   <fctr>    <lgl> <int>
## 1     no    FALSE     9
## 2     no     TRUE     9
## 3    yes    FALSE   259
## 4    yes     TRUE   105

There are only 18 out of the 382 that are not interested in higher education. A bit surprising. High use was 1:1 in those not wanting to take higher education so at least my hypothesis holds, kind of. The n is so small one should not make statistical inferences. The ratio amongst those wanting to take higher education is about 1:3.5, much less than 1:1.

# 24. Family relationship. 
# Do a histogram of relationship levels by high alcohol use.
g3 <- ggplot(alc, aes(x = high_use))
g3 + geom_bar() + facet_wrap("famrel") + ggtitle("Students divided by family relationships (1-5) to high users (F/T)") + xlab("High alcohol use (False/True)")

There seems to be more high alcohol users in those reporting poor family relationships. Fortunately those reporting very low relationships are a minority. Hypothesis stands strong.

# 32. Final grade.
# Do boxplots of high and low alcohol users and their final grades (G3).
g4 <- ggplot(alc, aes(x = high_use, y = G3))
g4 + geom_boxplot() + ggtitle("Students' final grades grouped by alcohol use") + xlab("High alcohol use (False/True)") + ylab("Final grade")

There is a small difference between the high and low users. High users of alcohol have worse grades on average than low users. There is large overlap between the groups though.

4.

## Doing the logistic regression model.
lrModel <- glm(high_use ~ activities + higher + famrel + G3, data = alc, family = "binomial")

summary(lrModel)
## 
## Call:
## glm(formula = high_use ~ activities + higher + famrel + G3, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3793  -0.8390  -0.7433   1.3588   1.9027  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    1.48228    0.71570   2.071   0.0383 *
## activitiesyes -0.16942    0.22912  -0.739   0.4596  
## higheryes     -0.53716    0.51831  -1.036   0.3000  
## famrel        -0.24829    0.12285  -2.021   0.0433 *
## G3            -0.06856    0.03592  -1.909   0.0563 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 453.42  on 377  degrees of freedom
## AIC: 463.42
## 
## Number of Fisher Scoring iterations: 4

The model has only two variables that seem worthwhile: family relationships and the final grade. Activities and higher education aspirations have a low chance of being significant in predicting high alcohol use. Having good family relationship and good final grade predicts lower alcohol use in the model.

# Compute odds ratios and confidence intervals from the model's coefficients
oddsR <- coef(lrModel) %>% exp
confI <- confint(lrModel) %>% exp
## Waiting for profiling to be done...
# Print odds ratios and confidence intervals. The confidence intervals of activities, higher education aspirations, and final grades span 1, meaning that there might be no predictive value with those variables. The family relations almost get to 1 also. 
cbind(oddsR, confI)
##                   oddsR     2.5 %     97.5 %
## (Intercept)   4.4029534 1.0892712 18.2873241
## activitiesyes 0.8441515 0.5383786  1.3236738
## higheryes     0.5844052 0.2094452  1.6406627
## famrel        0.7801305 0.6125070  0.9932711
## G3            0.9337328 0.8697591  1.0017032

The hypotheses for family relationships and final gradesh still hold, but rather tenuously if one considers the confidence intervals. Final grades have odds ratio of 1:1.07 so while it might be marginally significant it is of little practical value. I’d say only the family relationships variable has any value.

5.

Explore the predictive power of the model with the significant variables. I’m including only the family relationships as the final grades had only a marginally significant probability of having a significant parameter in the model, and also spanning 1 in it’s confidence interval.

# Refit the model with only family relationships
strippedModel <- glm(high_use ~ famrel, data = alc, family = "binomial")

# Predict the high use of alcohol using the stripped down model and the original (for interests sake).
predHighStripped <- predict(strippedModel, type = "response")
predHighAll <- predict(lrModel, type = "response")

# Add the probabilities to data structure alc
alc <- mutate(alc, probStripped = predHighStripped)
alc <- mutate(alc, probAll = predHighAll)

# Make predictions of high use
alc <- mutate(alc, predStripped = probStripped > 0.5)
alc <- mutate(alc, predAll = probAll > 0.5)

# Check that code was fine. Look at the first twenty values
select(alc, famrel, high_use, probStripped, predStripped, probAll, predAll) %>% head(20)
##    famrel high_use probStripped predStripped   probAll predAll
## 1       4    FALSE    0.2927178        FALSE 0.3551263   FALSE
## 2       5    FALSE    0.2410230        FALSE 0.3005091   FALSE
## 3       4     TRUE    0.2927178        FALSE 0.3095389   FALSE
## 4       3    FALSE    0.3503816        FALSE 0.2831074   FALSE
## 5       4    FALSE    0.2927178        FALSE 0.2950794   FALSE
## 6       5    FALSE    0.2410230        FALSE 0.1937715   FALSE
## 7       4    FALSE    0.2927178        FALSE 0.2950794   FALSE
## 8       4    FALSE    0.2927178        FALSE 0.3243810   FALSE
## 9       4    FALSE    0.2927178        FALSE 0.2171708   FALSE
## 10      5    FALSE    0.2410230        FALSE 0.1937715   FALSE
## 11      3    FALSE    0.3503816        FALSE 0.3492025   FALSE
## 12      5    FALSE    0.2410230        FALSE 0.2160970   FALSE
## 13      4    FALSE    0.2927178        FALSE 0.2480893   FALSE
## 14      5    FALSE    0.2410230        FALSE 0.2461718   FALSE
## 15      4    FALSE    0.2927178        FALSE 0.2413852   FALSE
## 16      4    FALSE    0.2927178        FALSE 0.2413852   FALSE
## 17      3    FALSE    0.3503816        FALSE 0.2831074   FALSE
## 18      5    FALSE    0.2410230        FALSE 0.2160970   FALSE
## 19      5     TRUE    0.2410230        FALSE 0.2937649   FALSE
## 20      3    FALSE    0.3503816        FALSE 0.3266436   FALSE
table(high_use = alc$high_use, prediction = alc$predStripped)
##         prediction
## high_use FALSE
##    FALSE   268
##    TRUE    114
# Hmm. There seems to be no high users predicted by the stripped down model. Count n of TRUE values in predStripped to make sure.
sum(alc$predStripped)
## [1] 0
# No True values... Let's see then how the full model would fare.
table(high_use = alc$high_use, prediction = alc$predAll)
##         prediction
## high_use FALSE TRUE
##    FALSE   258   10
##    TRUE    112    2
# Better, but not great. The sensitivity of the model with all variables is terrible (2/114), and the specificity is fine (258/268). 

# Lets compute the the total proportion of inaccurately classified individuals for hoots and giggles. Using the full model as the stripped version would be a bit boring to test.
nIndividuals <- nrow(alc)
nIncorrectPrediction <- sum(alc$high_use != alc$predAll)
nIncorrectPrediction/nIndividuals
## [1] 0.3193717
# The proportion of inaccurate predictions were 32 % (122 out of 382). This gives an inflated view of how good the model is, as this mostly is due to the model having very few predictions for high usage.

# Let's compare the model to a simple strategy of if family relationship is 2 or less, then predict high alcohol use.
alc$simpleClassify <- alc$famrel <= 2
table(high_use = alc$high_use, prediction = alc$simpleClassify)
##         prediction
## high_use FALSE TRUE
##    FALSE   252   16
##    TRUE    103   11
nIncorrectPrediction <- sum(alc$high_use != alc$simpleClassify)
nIncorrectPrediction/nIndividuals
## [1] 0.3115183

The proportion of inaccurate predictions were 31 % (119 out of 382). The simple model was better than the full model, but not by much. The sensitivity increased to almost 50 % though which is much better than the full model. Less than 50 % chance is still not impressive though.

6.

Perform a 10-fold cross-validation of the full model. One needs to define a loss function for the cross-validation code (cv.glm).

# Define loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$predAll)
## [1] 0.3193717
# 10-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = lrModel, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.3272251

My full model has worse performance than the model in DataCamp, that is 0.32 vs 0.26, lower values being better.


Chapter 4. Clustering and classification

Analysis part

Boston data has 506 observations and 14 variables. It has information about housing in suburbs of Boston.

Description of variables in the Boston dataset:

  1. crim - per capita crime rate by town.

  2. zn - proportion of residential land zoned for lots over 25,000 sq.ft.

  3. indus - proportion of non-retail business acres per town.

  4. chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

  5. nox - nitrogen oxides concentration (parts per 10 million).

  6. rm - average number of rooms per dwelling.

  7. age - proportion of owner-occupied units built prior to 1940.

  8. dis - weighted mean of distances to five Boston employment centres.

  9. rad - index of accessibility to radial highways.

  10. tax - full-value property-tax rate per $10,000.

  11. ptratio - pupil-teacher ratio by town.

  12. black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

  13. lstat - lower status of the population (percent).

  14. medv - median value of owner-occupied homes in $1000s.

2. Load Boston data from MASS package

# Load required libraries
library(ggplot2)
library(dplyr)
library(corrplot)
library(GGally)
library(tidyr)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data('Boston')

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

3. Graphical overview of the data.

pairs(Boston)

# print the correlation matrix
cor_matrix<-cor(Boston) 
cor_matrix %>% round(digits = 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
# Histograms of the variables
Boston %>% 
  gather(key=var_name, value = value) %>% 
  ggplot(aes(x=value)) +
  geom_histogram() +
  facet_wrap(~var_name, scales = "free_x")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Create color ramp from dark blue to white to red.
colorVector <- c("blue", "white", "red")

# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6, col = colorRampPalette(colorVector)(200))

The histogram plot shows that much of the data does not look like gaussian normally distributed data. Most of the data variables has a high kurtosis, or has two peaks. The correlation plot shows that there are many high positive correlations between different variables: like industy and NO2 gas level and tax revenue; property taxes and access to radial highways. There are some strong negative correlations ones also like median value of owner-occupied homes and lower status of the population; and between distances to five Boston employment centres and proportion of owner-occupied units built prior to 1940.

4. Standardize the dataset. Create crime rate categorical variable. Split data into training and data parts.

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# change the object to data frame from matrix type.
boston_scaled <- as.data.frame(boston_scaled)

sd(boston_scaled$crim)
## [1] 1
# Now all the variables have zero mean and SD of 1. This is recommended for clustering. 


# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime. The bins have either 127 or 126 elements.
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)


# Select 80 % of the data and 20 % of the data and split them into separate datasets.
# number of rows in the Boston dataset 
n <- nrow(Boston)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create training set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

5. Fit linear discriminant analysis on the training set.

# linear discriminant analysis. The . means all of the variables.
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object. Note that the first linear discriminant explains 95 % of the model, and the third one only 1.2 %.
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2599010 0.2301980 0.2524752 0.2574257 
## 
## Group means:
##                   zn      indus        chas        nox          rm
## low       0.93383062 -0.9191416 -0.08484810 -0.8722894  0.43137209
## med_low  -0.04532905 -0.3554010 -0.01832260 -0.6124126 -0.09266427
## med_high -0.38467135  0.1859628  0.15226017  0.4378437  0.08258050
## high     -0.48724019  1.0170690 -0.04518867  1.0678298 -0.36764984
##                 age        dis        rad        tax    ptratio
## low      -0.9248669  0.9105139 -0.6854573 -0.7331945 -0.4492845
## med_low  -0.3403051  0.4434687 -0.5348335 -0.5038331 -0.1691896
## med_high  0.4133696 -0.3770553 -0.4301567 -0.3137256 -0.2239989
## high      0.8062596 -0.8403767  1.6386213  1.5144083  0.7813507
##                black        lstat        medv
## low       0.37472057 -0.772276328  0.52579215
## med_low   0.30499367 -0.169374658  0.04354925
## med_high  0.09350318 -0.006730546  0.12893419
## high     -0.76874314  0.828148564 -0.68239693
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.12092557  0.55380084 -1.03278014
## indus    0.04741820 -0.20883528  0.38731713
## chas    -0.11420221  0.01310448  0.14283731
## nox      0.36476296 -0.83707637 -1.44571956
## rm      -0.08580832 -0.12362086 -0.22276278
## age      0.21790722 -0.40554794  0.13898853
## dis     -0.09451281 -0.18490790  0.34854195
## rad      3.31008817  1.02107442  0.11659879
## tax      0.11838268 -0.01230347  0.36478142
## ptratio  0.13837825 -0.08799499 -0.52089679
## black   -0.08428539  0.01672739  0.09501976
## lstat    0.18476558 -0.11298308  0.47375165
## medv     0.18155063 -0.35056803 -0.06434031
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9521 0.0358 0.0122
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results. Both lines need to be run at the same time. High class of crime rate looks to cluster pretty well with only a few medium high class elements in it, and only one high class element being at LD1 value of ~0.
plot(lda.fit, col = classes, pch = classes, dimen = 2)
lda.arrows(lda.fit, myscale = 1)

6. Predict the classes with the LDA model.

# Saving of the crime variable in a vector was done in step 5. Remove the crime variable from test data not necessary either.
# test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       15       5        2    0
##   med_low    6      15       12    0
##   med_high   0       5       17    2
##   high       0       0        0   23

The categorization of the low and especially high classes of properties was quite successful. The medium low and medium high classes were only about 60 % correct. There was more confusion about the low crime rate properties compared to the high crime rating properties. Having shifting values depending on different runs of the script is annoing.

7. Start again with the Boston dataset. Test what number of clusters to run K-means clustering with would make sense.

# Reload Boston dataset.
data("Boston")

# center and standardize variables
boston_scaled <- scale(Boston)

# change the object to data frame from matrix type.
boston_scaled <- as.data.frame(boston_scaled)

# Calculate the Euclidean distances between observations.
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4620  4.8240  4.9110  6.1860 14.4000
# K-means clustering
km <-kmeans(boston_scaled, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

# Determening how many clusters to have using within cluster sum of squares (WCSS). Set a seed to make iterations give the same result.
set.seed(123)

# determine the number of clusters using the total within sum of squares (twcss)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results. Two or 3 seems reasonable. For visualization purposes 2 is handier so lets go with that.
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(boston_scaled, centers = 2)

# plot the normalized Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

# Tried to use ggpairs but that didn't manage to color with the clusters. One does get the correlation values and distributions with ggpairs though.
ggpairs(boston_scaled)

Per capita crime rate by town (crime) is correlated most strongly with accessibility to radial highways (rad, 0.63), full-value property-tax rate (tax, 0.58), and lower status of the population (lstat, 0.46). Rad and tax are very highly correlated (0.91), and rad and tax are both correlated somewhat strongly with lstat (~.45). Since the rad and tax are so highly correlated it is hard to disentangle what encourages crime in the area, good connections or wealth as measured by taxes. I would have imagined that lower status would have been negatively correlated with tax. The tax value is likely not a good indicator of the wealth in the area. I find it a bit hard to guess at what these correlations mean. One interesting value is that having higher proportion of blacks by town has a weak negative correlation with per capita crime rate.

Super bonus, 3D plots.

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

library(plotly)

# Plot with crime classes as color of the training data.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
# Plot with clusters of k-means as color of the training data. 
# First one needs to do k-means with 4 clusters to compare the methods.
km3D <-kmeans(boston_scaled, centers = 4)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km3D$cluster[ind])

Comparison of the methods using eye-balling: the high crime rate cluster and third cluster of k-means are rather similar. The k-means clustering seems to work better as far as visuals go: less intermingling. The crime classes are more intermingled even to the point where low and medium high crime rate classes are interspersed. Cluster 1 and low somewhat match, but cluster 2 and 4 and medium high and medium low clusters do not match well. At least a one run of the file. Maybe not another. Note that you might need to left-click poke at the figures to see anything.


Chapter 5. Dimensionality reduction techniques

Load necessary libraries

# Load required libraries
library(ggplot2)
library(dplyr)
library(corrplot)
library(GGally)
library(tidyr)

1. Load human data

human <- read.csv("data/human.csv", header = TRUE, row.names = 1)

# Look at the data
str(human)
## 'data.frame':    155 obs. of  7 variables:
##  $ eduFemDivM     : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ labFemDivM     : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ educExpect     : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ lifeExpect     : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ grossIncome    : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ motherMortality: int  4 6 6 5 6 7 9 28 11 8 ...
##  $ womenParliament: num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155   7

There are 155 countries in the human dataframe with 8 variables. The variables are:

eduFemDivM: Ratio of females to males educated at secondary and higher education levels

labFemDivM: Ratio of females to males labour market participation rate (LFPR)

educExpect: Expected years of schooling

lifeExpect: Life expectancy at birth (years)

grossIncome: Gross national income per capita (GNI)

motherMortality: Maternal mortality ratio (MMR) (deaths per 100 000 live births)

adolBirthsR: Adolescent birth rate (ABR) (births per 1 000 women ages 15-19)

womenParliament: Share of parliamentary seats held by women (% of seats)

Further information can be found in http://hdr.undp.org/en/content/human-development-index-hdi

2. Graphical overview of the data.

Show summaries.

summary(human)
##    eduFemDivM       labFemDivM       educExpect      lifeExpect   
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##   grossIncome     motherMortality  womenParliament
##  Min.   :   581   Min.   :   1.0   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :57.50
# Histograms of the variables
human %>% 
  gather(key=var_name, value = value) %>% 
  ggplot(aes(x=value)) +
  geom_histogram() +
  facet_wrap(~var_name, scales = "free_x")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Most of the data are not normally distributed. Adolescent births rate, GNI, maternal mortality are heavily tailed with most of the values being low. Education expectation, women in parliament and women’s labour participation values are roughly normally distributed although with high kurtosis values. Education ratio of females to men and life expectancy values have multiple peaks and have complicated distributions.

# Correlations of the variables.
# Create color ramp from dull dark blue to white to dull red.
colorVector <- c("#4477AA", "#77AADD", "#FFFFFF", "#EE9988", "#BB4444")

# Print the correlation matrix
corMatrix<-cor(human) 
corMatrix %>% round(digits = 2)
##                 eduFemDivM labFemDivM educExpect lifeExpect grossIncome
## eduFemDivM            1.00       0.01       0.59       0.58        0.43
## labFemDivM            0.01       1.00       0.05      -0.14       -0.02
## educExpect            0.59       0.05       1.00       0.79        0.62
## lifeExpect            0.58      -0.14       0.79       1.00        0.63
## grossIncome           0.43      -0.02       0.62       0.63        1.00
## motherMortality      -0.66       0.24      -0.74      -0.86       -0.50
## womenParliament       0.08       0.25       0.21       0.17        0.09
##                 motherMortality womenParliament
## eduFemDivM                -0.66            0.08
## labFemDivM                 0.24            0.25
## educExpect                -0.74            0.21
## lifeExpect                -0.86            0.17
## grossIncome               -0.50            0.09
## motherMortality            1.00           -0.09
## womenParliament           -0.09            1.00
# Visualize the correlation matrix


corrplot(corMatrix, method = "color", col = colorRampPalette(colorVector)(200),
         type = "upper", order = "hclust", number.cex = .8,
         addCoef.col = "black", # Add coefficient of correlation
         tl.col = "black", tl.srt = 30, # Text label color and rotation
         # Combine with significance
         #p.mat = p.mat, sig.level = 0.01, insig = "blank", 
         # hide correlation coefficient on the principal diagonal
         diag = FALSE)

The highest positive correlations were between education expectancy and life expectancy. GNI, education expectation, and life expectancy all had quite strong positive correlations to each other. Maternal mortality was strongly negatively correlated with life expectancy, education expectancy and ratio of female to male education. The variables women in parliament and ratio of women in labour force had low correlations to other variables.

3. Perform PCA on non-standardized human data.

Examine the results.

# perform principal component analysis (with the SVD method)
pcaHuman <- prcomp(human)

# print out a summary of PCA. One gets quite a few warnings. The first component explains a whopping 99.99 % of the variance.
s <- summary(pcaHuman)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5    PC6    PC7
## Standard deviation     1.854e+04 184.0810 11.46 3.796 1.598 0.1912 0.1592
## Proportion of Variance 9.999e-01   0.0001  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00 1.000 1.000 1.0000 1.0000
# rounded percetanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)

# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 
## 100   0   0   0   0   0   0
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot of the principal component representation and the original variables using the first 2 components. GNI explains looks to explain pretty much all of the first principal component.
biplot(pcaHuman, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], xlim = c(-0.5, 0.2))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

4. Perform PCA on standardized human data.

Examine and compare to the non-standardized data.

humanStand <- scale(human)

pcaHumanStand <- prcomp(humanStand)

# print out a summary of PCA. One gets quite a few warnings.
s2 <- summary(pcaHumanStand)
s2
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     1.9026 1.1365 0.8749 0.77835 0.63018 0.45962
## Proportion of Variance 0.5171 0.1845 0.1094 0.08655 0.05673 0.03018
## Cumulative Proportion  0.5171 0.7016 0.8110 0.89752 0.95425 0.98443
##                            PC7
## Standard deviation     0.33012
## Proportion of Variance 0.01557
## Cumulative Proportion  1.00000
# rounded percetanges of variance captured by each PC. 
pca_pr2 <- round(100*s2$importance[2, ], digits = 1)

# print out the percentages of variance. Now the components explain the data much more diversly. The first one explains 52 % of the variability, with the next 3 components explaining 43 % of the variability. 
pca_pr2
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7 
## 51.7 18.5 10.9  8.7  5.7  3.0  1.6
# create object pc_lab to be used as axis labels
pc_lab2 <- paste0(names(pca_pr2), " (", pca_pr2, "%)")

# draw a biplot of the principal component representation and the original variables using the first 2 components. GNI explains looks to explain pretty much all of the first principal component.
biplot(pcaHumanStand, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab2[1], ylab = pc_lab2[2], xlim = c(-0.25, 0.25))

# The results are quite different. Before there was only one component of any note. After scaling there are say 3 or 4 significant components. I would guess most of the differences are due to scaling normalizing the data, which is an expected attribute in most analyses. Re-do the histogram of beginning to check this out.

as.data.frame(humanStand) %>% 
  gather(key=var_name, value = value) %>% 
  ggplot(aes(x=value)) +
  geom_histogram() +
  facet_wrap(~var_name, scales = "free_x")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

No, not that different actually. One major thing at least is that PCA assumes that large values mean more importance. So before the GNI which had way bigger numbers was given the most importance. This does not make too much sense, as the units are different for different variables.

5. Interpretation

One can with standardization more easily see the correlations between different variables. PC1 is composed mostly of educational expectation, GNI, ratio of female to male education, life expectancy and maternal mortality. PC2 is composed mostly of women and parliament and females in labour force ratio. The biplots are certainly easier to read after scaling as the different variables are on similar scales instead of wildly different ones. I would imagine that PC1 is mostly the level or resources put into people, like medicine and schooling. PC2 might be some kind of equality measure that measures how well can women attend the working life instead of being home wives.

6. And now for something completely different.

# Load tea dataset from Factominer.
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 3.3.3
data(tea)

# Explore the data. The tea dataset has 300 observations and 36 variables.
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
# Since there are so many variables one needs to split them for visualization.
gather(tea[1:12]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")  + geom_bar() + theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

gather(tea[13:24]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")  + geom_bar() + theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

gather(tea[25:36]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")  + geom_bar() + theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

# Perform multiple correspondence analysis on the tea data. Some of the columns seem to give errors, so keep only a subset of variables.
keep_columns <- c("Tea", "how", "sugar", "where", "lunch", "exciting", "price", "Sport")

# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))

mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.264   0.217   0.159   0.146   0.139   0.132
## % of var.             14.065  11.566   8.481   7.801   7.420   7.037
## Cumulative % of var.  14.065  25.632  34.113  41.913  49.333  56.370
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.124   0.116   0.111   0.106   0.093   0.087
## % of var.              6.640   6.189   5.903   5.675   4.958   4.665
## Cumulative % of var.  63.010  69.199  75.102  80.777  85.735  90.401
##                       Dim.13  Dim.14  Dim.15
## Variance               0.071   0.064   0.045
## % of var.              3.800   3.401   2.399
## Cumulative % of var.  94.200  97.601 100.000
## 
## Individuals (the 10 first)
##                         Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1                    | -0.356  0.160  0.033 |  0.469  0.338  0.057 |
## 2                    | -0.122  0.019  0.013 | -0.028  0.001  0.001 |
## 3                    | -0.228  0.066  0.070 | -0.087  0.012  0.010 |
## 4                    | -0.385  0.187  0.171 |  0.057  0.005  0.004 |
## 5                    | -0.228  0.066  0.070 | -0.087  0.012  0.010 |
## 6                    | -0.404  0.206  0.074 |  0.272  0.114  0.034 |
## 7                    | -0.228  0.066  0.070 | -0.087  0.012  0.010 |
## 8                    | -0.053  0.004  0.003 |  0.036  0.002  0.001 |
## 9                    |  0.629  0.501  0.250 | -0.520  0.416  0.171 |
## 10                   |  0.771  0.752  0.298 | -0.273  0.115  0.037 |
##                       Dim.3    ctr   cos2  
## 1                    -0.240  0.120  0.015 |
## 2                    -0.313  0.205  0.083 |
## 3                    -0.189  0.075  0.048 |
## 4                     0.011  0.000  0.000 |
## 5                    -0.189  0.075  0.048 |
## 6                    -0.063  0.008  0.002 |
## 7                    -0.189  0.075  0.048 |
## 8                    -0.587  0.721  0.325 |
## 9                    -0.262  0.144  0.043 |
## 10                   -0.844  1.494  0.357 |
## 
## Categories (the 10 first)
##                          Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black                |   0.477   2.657   0.074   4.717 |   0.239   0.813
## Earl Grey            |  -0.243   1.795   0.106  -5.635 |  -0.218   1.768
## green                |   0.350   0.639   0.015   2.128 |   0.741   3.479
## tea bag              |  -0.574   8.857   0.431 -11.355 |   0.367   4.408
## tea bag+unpackaged   |   0.339   1.706   0.052   3.958 |  -1.017  18.690
## unpackaged           |   1.827  18.981   0.455  11.665 |   0.921   5.872
## No.sugar             |   0.245   1.466   0.064   4.375 |  -0.036   0.040
## sugar                |  -0.262   1.568   0.064  -4.375 |   0.039   0.042
## chain store          |  -0.530   8.520   0.499 -12.219 |   0.291   3.119
## chain store+tea shop |   0.507   3.163   0.090   5.193 |  -1.138  19.413
##                         cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                  0.019   2.366 |  -0.955  17.688   0.299  -9.450 |
## Earl Grey              0.086  -5.070 |   0.312   4.926   0.176   7.248 |
## green                  0.068   4.503 |   0.316   0.865   0.012   1.923 |
## tea bag                0.176   7.264 |  -0.034   0.052   0.002  -0.673 |
## tea bag+unpackaged     0.472 -11.882 |  -0.139   0.474   0.009  -1.620 |
## unpackaged             0.116   5.883 |   0.523   2.579   0.037   3.339 |
## No.sugar               0.001  -0.651 |  -0.594  14.344   0.378 -10.625 |
## sugar                  0.001   0.651 |   0.635  15.333   0.378  10.625 |
## chain store            0.150   6.704 |  -0.017   0.014   0.000  -0.384 |
## chain store+tea shop   0.455 -11.666 |  -0.347   2.455   0.042  -3.552 |
## 
## Categorical variables (eta2)
##                        Dim.1 Dim.2 Dim.3  
## Tea                  | 0.107 0.105 0.299 |
## how                  | 0.623 0.503 0.039 |
## sugar                | 0.064 0.001 0.378 |
## where                | 0.677 0.512 0.133 |
## lunch                | 0.000 0.164 0.132 |
## exciting             | 0.019 0.013 0.181 |
## price                | 0.615 0.386 0.027 |
## Sport                | 0.004 0.052 0.084 |
# visualize MCA variables
plot(mca, invisible=c("ind"), habillage = "quali")

# Dimensions 1 and 2 of MCA correspond mostly to what package people use their tea in, where they drink and the price of the tea. Dimension 3 corresponds somewhat to what kind of tea is drank, and if they add sugar or not. 

# Visualize MCA individuals
plot(mca, invisible=c("var"), habillage = "quali")

So many people it’s hard to see different numbers in the clouds. If one could zoom dynamically or something would be cool. Anyway one can for example see that in the upper right individual 190 and 208 are quite similar in their tea habits, as they are close in the plot.